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QUANTILE ESTIMATION WITH ADAPTIVE IMPORTANCE 

SAMPLING 

By Daniel Egloff and Markus Leippold 

QuantCatalyst and University of Zurich 

We introduce new quantile estimators with adaptive importance 
sampling. The adaptive estimators are based on weighted samples 
that are neither independent nor identically distributed. Using a 
new law of iterated logarithm for martingales, we prove the con- 
vergence of the adaptive quantile estimators for general distributions 
with nonunique quantiles thereby extending the work of Feldman and 
Tucker [Ann. Math. Statist. 37 (1996) 451-457]. We illustrate the al- 
gorithm with an example from credit portfolio risk analysis. 

1. Introduction. We introduce a new sample-based quantile estimators 
with adaptive importance sampling. Importance sampling is a widely used 
technique for variance reduction to improve the statistical efficiency of Monte 
Carlo simulations. It reduces the number of samples required for a given level 
of accuracy. The basic idea is to change the sampling distribution so that 
a greater concentration of samples are generated in a region of the sam- 
ple space which has a dominant impact on the calculations. The change of 
distribution is then compensated by weighting the samples using the Radon- 
Nikodym derivative of the original measure with respect to the new measure. 
However, in a multivariate setting, it is far from obvious how such a change 
of measure should be obtained. 

Given its importance for practical applications, especially for risk man- 
agement in the finance industry, the literature on sample-based quantile 
estimation with variance reduction is rather sparse. 1 The focus of variance 
reduction schemes is almost exclusively geared towards the estimation of 
expected values. The reason might lie in the additional intricateness that 
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sample-based quantile estimators exhibit. The quantile function, viewed as 
a map on the space of distribution functions, generally fails to be differen- 
tiable in the sense of Hadamard. For certain distributions, the quantile may 
be nonunique. If the lower and upper quantiles of a random variable Y for 
a probability level a G (0, 1), defined as 

q a (Y) = mf{y\¥(Y<y)>a}, 
q a (Y)=sup{y\F(Y<y)<a}, 

are distinct, then the ordinary quantile estimator Y[ nQ ,j )n based on the order 
statistics Yi >n , . . . , Y nn of samples Y±, . . . , Y n is not consistent anymore. For 
the special case of independent and identically distributed (i.i.d.) samples 
Y±, . . . ,Y n , Feldman and Tucker [12] prove that Y^ na ^ n oscillates across the 
interval [q a (Y), q a (Y)] infinitely often. Using the classical law of iterated 
logarithm for sequences of i.i.d. random variables, they also show that con- 
sistency can be retained for the modified estimator Y u ( n ^ n if the function 
v(n) S N satisfies 

(1.1) (1 + k)y/2nloglogn< [na\ - u(n) < Kn 1 / 2 ^' 

for some positive constants 7, k, K. 

For the estimation of expected values with importance sampling, a com- 
mon procedure is to apply the change of measure suggested by a large devi- 
ation upper bound. Although this approach often leads to an asymptotically 
optimal sampling estimator, it can also fail completely, as shown in Glasser- 
man and Wang [14]. 

Our method for quantile estimation does not rely on large deviation prin- 
ciples. Instead, it is adaptive. Adaptive algorithms, but only for expected 
values and not for quantiles, are introduced in the work of [2] and [3]. They 
apply the truncated Robbins-Monro algorithm of Chen, Guo and Gao in [7] 
for pricing financial options under different assumptions on the underlying 
process. Robbins-Monro methods and stochastic approximation date back 
to the historical work of Robbins and Monro [30] and Kiefer and Wolfowitz 
[21]. See also [23] and the more recent references [5, 24] and [25]. 

Using an adaptive strategy to obtain a quantile estimator means that 
every new sample is used to improve the parameters of the importance sam- 
pling density. Therefore, we cannot rely on the results of Feldman and Tucker 
[12]. Our quantile estimators, derived from weighted samples, are neither 
independent nor identically distributed. However, we derive a new law of it- 
erated logarithms for martingales which allows us to prove the convergence 
of the adaptive quantile estimators for distributions with nondifferentiable 
and nonunique quantiles without requiring the i.i.d. assumption thereby ex- 
tending the result of Feldman and Tucker. 
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Our paper is structured as follows. In Section 2, we present the general 
setup and we introduce the notation. Section 3 gives a brief review of adap- 
tive importance sampling for estimating the mean. In Section 4, we start 
with the discussion of the metric structure underlying our adaptive algo- 
rithm. We then derive two theorems, Theorems 4.1 and 4.2, which extend 
Feldman and Tucker [12]. The proof of the theorems build on a new result 
for the law of iterated logarithms for martingales which we present in Theo- 
rem 4.4. Finally, in Section 5 we provide an application of our new quantile 
estimator which we borrow from credit risk management. All proofs are 
delegated to the Appendix. 

2. Setup and notation. Let <fe(x) be a probability density depending on 
a parameter 9 of a random variable X relative to some reference measure 
A, defined on a measurable space (X,T) with a countably generated cr-field 
T . We assume that the parameters 9 take their values in a metric space 
(&,d) for some fixed metric d and equip it with the Borel u-algebra B{&). 
For now, we do not have to further distinguish the parameter space and 
the set of densities {ipo(x) \ 9 £ 0}. For the expectation, under the measure 
ipg dX, we write 



(2.1) W(A)]= / f(x)Mx)d\(x) 

Jx 

and we define for all p, 1 < p < oo, 

C p {9) = {/: X -)■ R | / is ^-measurable, \\f\\ p 9>p = E e [\f(X)\P] < oo}. 

Let (fQ Q (x) be our reference or sample density. We assume that all densities 
in are absolutely continuous with respect to the reference density (fg (x) 
and we denote by 

(2.2) = 

the likelihood ratio or Radon-Nikodym derivative. In particular, w x '■ x i— > 
w x {9) is measurable for all 9 £ 0. If / S Ci(9q), we have 

(2.3) E e [w x (9)f(X)]=E eo [f(X)} V#G0. 
For p > 1, we introduce the (weighted) moments 

(2.4) m Lp (9)=E e [\w x (9)f(X)\P]=K 9o [w x (9r 1 \f(X)n 

We use the abbreviation rrif{9) = mf^{9) for the second moment and 

(2.5) aj(9) = Vex e [w X (e)f(X)] = m f {9) - E eo [f(X)} 2 
for the variance. 
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3. Review: Adaptive importance sampling for estimation of means. Be- 
fore we derive our adaptive quantile estimators, we start this section with a 
brief review of adaptive importance sampling for the estimation of means. 
Consider a function / £ Ci(9q). Static importance sampling estimates the 
expectation Eg [f(X)] by the weighted sample average 

1 n 

(3.1) %(nJ) = - J £ ( Wx i (0)f(Xi) 

i=i 

with Xi ~ ipg dX i.i.d. The usual error estimates based on the central limit 
theorem indicate that the most advantageous choice for 9 would be the 
variance minimizer 

(3.2) 9* = argmin<7?(#) = argminmj(#). 

6»e© 0e0 

Unfortunately, in most cases (3.2) cannot be solved explicitly. An alternative 
to the approach on the basis of large deviation upper bounds is an adaptive 
strategy. The solution 9* is estimated by a sequence (9 n ) n >o, generated, 
for instance, by a stochastic approximation algorithm of Kiefer-Wolfowitz 
or Robbins-Monro type. Replacing the fixed parameter 9 in (3.1) by the 
sequentially generated parameters (# n )n>o leads to the adaptive importance 
sampling estimator 

1 n 

(3.3) e a (n,/) = (0<-i)/(*i), 

i=l 

where X n ~ ipg n _ 1 (x) d\(x) is simulated from the importance sampling dis- 
tribution determined from the parameter 9 n -\. In contrast to static impor- 
tance sampling, the random variables wx n {9 n -\) and f(X n ) in (3.3) are 
neither independent nor identically distributed. However, we still obtain a 
martingale. 

Lemma 3.1. Let 9 n be a sequence of parameters and X n ~ (fo n -i dX. 
Define T n = <r(6 , ...,6 n ,Xi,.. .,X n ). Then, for f G C\{0 Q ), 

n 

(3.4) M n = Y,(vxM-i)f(Xi) ~ Eft, [/(*)]) 

i=l 

is a martingale with respect to the filtration F= (J- n )n>o- 

Proof. If (Z n ) n >o is a sequence of integrable random variables, then 

n 

(3.5) M n = Y,(Z t ~ E[Zi \Z U ..., 

i=l 



ADAPTIVE QUANTILE ESTIMATION 5 

is a martingale. The integrability of wXi(9i-i)f(Xi) and the martingale 
property for (3.4) follow from 

(3.6) E[w Xi (ei-i)f(Xi) | = E fli _> X4 (0 i _ 1 )/(X i )] = E flb [/(*)], 
where the second equality is a consequence of (2.3). □ 

A strong law of large numbers and a central limit theorem for (3.3) has 
been obtained in [2] by applying classical martingale convergence results for 
which we refer to [17] and [15]. For a proof of the theorem below, we refer 
to [2]. 

Theorem 3.1. Let 9 n , X n , and ¥= (J- n ) n >o be as in Lemma 3.1. As- 
sume that 9 n — >• 9* £ converges almost surely and that there exists a > 1 
such that for all 9 £ & 

(3.7) E e [\w x (9)f(X)\ 2a }<^, 

the function mf t 2a '■ 9 w/,2a(#) is continuous in 9*, and 

(3.8) E[m /|2a (0 n )] < oo V?i>0. 
Then 

1 " 

(3.9) lim - V w Xl {9 i - l )f{X i ) = E 6o [f(X)} almost surely, 



n 
i=i 



and 



(3.10) v 7 ^ ( - E w * " E «o lf( X )) 
where -4 denotes convergence in distribution. 



N(0,a 2 f (9*)), 



4. Adaptive importance sampling for quantile estimation. Having re- 
viewed the estimation of the mean with adaptive importance sampling in 
the previous section, we introduce now the metric structure and the algo- 
rithm that underlies our new adaptive quantile estimation. 



4.1. Riemannian structure for parameter tuning. The procedure to esti- 
mate the variance optimal parameter (3.2) crucially depends on the metric 
structure of the parameter space 0. The metric is not only important if 
it comes to the actual numerical implementation, but is also material to 
determine existence and uniqueness of a solution. 

Let the parameter space be a smooth manifold. It is known that the 
canonical metric on a family of densities {ip$ (x) \ 9 £ 0} is induced by the 
Riemannian structure given by the Fisher information metric 



(4.1) 



g = E e [dlx®dl x }. 
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Here, l x {9) = log <p$(x) is the log-likelihood function with differential 

(4.2) di x -.e^T*e, 

considered as a one- form on and with T*0, the co-tangent space of the 
manifold 0. In particular, (4.1) defines a nondegenerate symmetric bilinear 
form on the tangent space T0, hence a Riemannian metric. 2 

Having equipped the parameter space with a Riemannian metric, we 
can formulate the first order condition for (3.2) in terms of the Riemannian 
gradient V as 

(4.3) Vm/(0) = O. 

Under suitable assumptions on X and the likelihood ratio w x (9), we can 
exchange integration and differentiation to arrive at 

(4.4) Vmj(fl) = -Ee [f(X) 2 wx(e)Vl x (e)} = -K e [f(X) 2 w x (9) 2 Vl x (9)} 

with \7l x {9) the Riemannian gradient of the log- likelihood. To approximate 
a solution Vmj(fl*) = 0, we can now use the representation (4.4) and a 
stochastic approximation scheme 

(4.5) 6 n+ i = 9 n + -f n+1 H(X n+1 ,9 n ), X n+ i ~ (p 6n dX, 
with average descent direction 

(4.6) H(X, 9) = -f{Xfw x {9) 2 Vl x {9). 

In this paper, we want to keep the focus on adaptive importance sampling 
for quantile estimation and we therefore restrict ourselves to vector spaces. 
An example with a flat metric is provided by the Gaussian densities 

(4.7) = {iV(6>,£) | 9 e R k } 

with fixed covariance structure S. 3 The first and second order differentials 
of the likelihood l x {9) are 

(4.10) dl x {9) = H~ 1 (x-9), d 2 l x (9) = -£ _1 . 



2 For the basic concepts of Riemannian geometry, we refer to [20, 22] and the references 
therein. The usage of the Riemannian metric based on the Fisher information goes back 



3 A well-known example of a nonflat Riemannian structure on a space of distributions 



to [28] 

is the Fisher information metric of a location scale family of densities 
(4.8) = L M (x) = I^£^|( M)C r) G K x 



A second example is given by the space of all multivariate normal distributions 
(4.9) 

which is not flat anymore. 



(4.9) G = {N(6,Z) |6»eR fc ,EeS+(fc)} 
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Hence, the Fisher metric on is 

(4.11) 9v(u,v) = -E e [d 2 l x {e){u,v)}=u T ^ 1 v. 
Because 

(4.12) gz(Vl x (9),u) = Vl x {9) T T,- x u = dl x {9)(u) = {x- ^) T S~ 1 n, 
the gradient of the likelihood with respect to the metric (4.11) is 

(4.13) Vl x (9) = (x-9). 

Note that the gradient V of the Fisher metric defers by a factor of S _1 from 
the gradient induced by the standard Euclidian metric. 

4.2. Parameter tuning with adaptive truncation. In practical applica- 
tions, the parameter space is often noncompact or it is difficult to a 
priori identify a bounded region to which the optimal parameter must be- 
long. We therefore suppose that the parameter sequence (6* n ) n >o is generated 
by an algorithm that enforces recurrence and boundedness of the sequence 
9 n by adaptive truncation. A series of work [6-10] shows that stochastic ap- 
proximation algorithms with adaptive truncation behave numerically more 
smoothly and converge under weaker hypotheses. No restrictive conditions 
on the mean field or a-priori boundedness assumptions have to be imposed. 
We follow Andrieu, Moulines and Priouret [1] who analyze the convergence of 
stochastic approximation algorithms with more flexible truncation schemes 
and Markov state-dependent noise. To specify the algorithm, we let (/Cj)j £ n 
be an increasing compact covering of satisfying 

oo 

(4.14) = [j Kj,Kj C int(/Cj+i) 

3=1 

and 

(4.15) 7 =( 7n ) ngN , e=(e n ) n6N , 

two monotonically decreasing sequences. We introduce the counting vari- 
ables 

(4.16) (n n , v ni Cn)neN G N x N x N, 

where K n records the active truncation set in the compact covering, v n counts 
the number of iterations since the last re-initialization (truncation) and Q n is 
the index in the sequences 7, e introduced in (4.15). If v n / 0, the algorithm 
operates in the active truncation set K, Kn so that 



(4.17) 



9j € IC Kn Vj < n with Vj ^ 0. 
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If v n = 0, the update at iteration n caused a jump outside of the active 
truncation set JC Kn and triggers a re- initialization at the next iteration n + 1 . 
We assume that a stochastic vector field is generated from a measurable map 

(4.18) H X x 0^0, 

where X and are both equipped with countably generated cr-fields B(X) 
and 0(0), respectively. We also suppose that is an open subset of some 
Euclidian vector space. 

To handle jumps outside the parameter space 0, we introduce an isolated 
point 9 C taking the role of a cemetery point. Let = U {6 C }. For an 
arbitrary 7 > 0, we define a kernel on X x by 



Q 7 (x,8;Ax B) = / Pe{dy)l {e + 1 H( y ,e)eB} 

J A 

b} / Pe(dy)1-{0+ 7 H(y,e)<£B}i 

J A 



(4.19) 

+ 1 



where (1,9)6^x0 and A G B(X), B G B(&), and Pe(dx) is a measure on 
X parameterized by 9. For a sequence of step sizes 7, we define the process 
(X n ,9 n ) by 

(4.20) (X n+ i,9 n+ i) ~ Q Jn+1 (X n ,9 n ; •) 

unless 9 n = 9 C , in which case we stop the process and set 9 n+ \ = 9 C , X n+ \ = 
X n . The law of the nonhomogeneous Markov process (4.20) with initial 
conditions (x,9), represented on the product space (X x 0) N , is denoted by 
P^g. Let Xq C X be a compact subset 

(4.21) §:X x 0^ Xq x JC , 

be a measurable map and <p : N — > Z with 0(n) > — n. 



Algorithm 4.1. The stochastic approximation algorithm with adap- 
tive truncation is the homogeneous Markov chain defined by the following 
transition law from step n to n + 1 : 

(i) If v n = 0, then we perform a reset operation which starts in Xq x /Co 
and draws 

(X n+1 ,9 n+1 ) ~ Q JCn ($(X n ,9 n );dx x d9). 
Otherwise, we simulate 



(X n+ i,9 n+ i) ~ Q 7Cn (X n ,8 n ;dx x d9). 
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(ii) If ||0 n +i - #n|| < e Cn an< i ^+1 G tnen we update 

^n+l = V n + 1, Cn+1 = Cn + 1) K n+1 = ^n! 

else we prepare for a reset operation in the next iteration by setting 

^n+l=°, Cn+1 = 4>(Cn), K n+1 = K n + l. 

The convergence of Algorithm 4.1 under suitable conditions on the mea- 
sure Pg(dx), the mean field h defined as 

(4.22) h(6)= [ H(x,e)P e (dx) 

Jx 

and the sequences 7, e are established in [1]. 

4.3. Quantile estimation. After having introduced the metric structure 
and the parameter tuning in the previous sections, we can now turn our 
focus to the estimation of quantiles of a real- valued random variable 

Y = V(X), V:X^R, 

defined in terms of an ^-measurable function ^. We denote by 

q a = q a (Y)=mf{y\F(Y<y)>a}, < a < 1, 

the lower a-quantile of Y. Furthermore, let (6 n )n>o be a sequence of tuning 
parameters. In favor of a more compact notation, we introduce the abbrevi- 
ations 

(4.23) w n = w Xn (0n-i), Y n = V(X n ), n>l. 

We recall that, under the assumptions of Theorem 3.1, the weights w n satisfy 

1 n 

(4.24) E[w n I F n -i] = E 9n _ 1 [w n ] = 1, - ->• 1 almost surely. 

1=1 

We first consider generalizations of the empirical distribution function to 
weighted samples. Because the sum of the weights Y17=i Wi ls no ^ neces ~ 
sarily normalized to one, we introduce the renormalized weighted empirical 
distribution function 

1 - 

( 4 -25) F n;W (y) = = ^2wil {Yi < y } 

2^=1 W i i=1 



4 In fact, [1] treat the more general case of state dependent noise where the measure 
Pg(dy) takes the form of a Markov kernel Pe(x,dy). 
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and set 

(4.26) F nyW , l/ (y) = -}—F n , w (y), 

where v : N — > M + is a normalization function, which we determine later to 
prevent the oscillation of the quantile estimators. We can use the increasing 
function F n>wv to define the quantile estimator 

(4.27) q n ,wA a ) = F n,w,A a ) = inf {y I F n,wAv) ^ a }> 

where F^~ w v is the generalized inverse of F n>WiU . Besides the re-normalized 
weighted empirical distribution function (4.25), there are alternative ways 
to generalize the empirical distribution function to weighted samples. For 
example, 



1 n 

( 4 - 28 ) F UM = -7-,Y. 



Wjli 

i=i 



v(n) Z 



puts the emphasis on the left tail of the distribution. However, if the concern 
is on the right tail, then 



(4.29) 



F n,v,M = ^E^W<»} + " ^ X>J 



n 

— T 



Wjl 



i L {Yi>y} 



v ; i=i 

is the proper choice. We denote the corresponding quantile estimators by 
(4.30) q l n, w ,v(u) = F^ >v (a), q r njW)V (°t) = F n%,A a ^ 

The functions (4.26), (4.28) and (4.29) are no longer genuine empirical distri- 
bution functions because conditions lim^-oo F{x) = and lim^oo F(x) = 
1 may be violated. However, we still have 

lim F nw v (y) = 0, lim F l (y) = 0, lim K w Jy) = 1. 

y— >~oo y—t—oo ' ' y— >oo ' ' 

For studying the convergence of the weighted quantile estimators, we as- 
sume that the sequence (# n )n>o is generated by any adaptive algorithm as 
described in Section 4.2 which converges to some limit value 6*. We would 
like to point out that it is not required that 9* is the solution of a vari- 
ance minimization problem such as given by (3.2). Later, we will propose a 
specific tuning algorithm and state verifiable conditions that guarantee its 
convergence. 
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Assumption 4.1. (ZCj)jgpj is a compact exhaustion of the parameter 
space as in (4.14). The sequence (0 n )n>o satisfies 

(4.31) n ->■ 0* E almost surely. 
For any p E (0, 1), there exists a constant C(p) such that 

(4.32) P(sup Kn >j) <C(p)p>, 

v n>l 7 

where K n is the counter of the active truncation set of (0 n )n>o defined in 
such a way that (4.17) holds. For some p* > 1, there exists W E L p * (6o) such 
that for any compact set K. C 0, 

(4.33) l {em w x (9) < C p *(dinm(lC))W(x), 

where C p * (diam(/C)) is a constant only depending on p* and the diameter 
of /C. The compact covering (4.14) is selected such that 

(4.34) C p . (diam^j)) < e V+"Vi 
for some positive constants k p *, m p *. 

Because of (4.32), the number of truncations remains almost surely finite 
and every path of 9 n remains in a compact subset of 0. However, this 
does not imply that there exists a compact set K* such that 8 n E K* almost 
surely for all n. 5 Condition (4.33) guarantees the continuity of moments as a 
function of the parameters 8. Condition (4.34) provides a growth restriction 
on the compact exhaustion (4.14). 

We first address convergence when quantiles are unique but without im- 
posing differentiability of the distribution function at the quantiles. 

Theorem 4.1. Assume that the distribution function F(y) = ¥(Y < y) 
is strictly increasing at q a . Under Assumption 4-1, 

Qn,w,u(ci) — > q a almost surely (n — > oo), 
for the normalization function u(n) = 1, and 

€i,w,u{ a ) Qa, q l n ,w,u{ a ) ~> Qa almost surely (n -)• oo), 
for is(n) = n. 

If the quantiles are not unique, a proper choice of the normalization func- 
tion v(n) eliminates the oscillatory behavior and leads to consistent estima- 
tors. For notational convenience, let 

(4.35) v = a\{9*) and v a = ) = ^ (fciao) -»(^ )■ 



5 We would like to thank the anonymous referee for pointing this out to us. 
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Theorem 4.2. Suppose the conditions in Assumption 4-1 are satisfied. 
If there exists r\ > 0, k > 0, and < 7 < \ such that 

n - kn 1 / 2 ^ 
n — (1 + rj) a/ 2m; log log(n-u) 

(4.36) 

n- (1 + r?)/a y/2nv a log log(nf Q ) 

< z/(n) < — , 

n + (1 + r/)\/2ra;loglog(ra;) 

then 
(4.37) 

Qn,w,u(ct) — ^ Qa almost surely (n — > 00). 
// there exists n > 0, fc > 0, and < 7 < \ such that 

(4.38) n + \^-J2nv a \og\og{nv a ) <u(n) <n + kn 1/2+J , 

1 — a 

then 

(4.39) q r n w u (a) — > q a almost surely (n — > 00). 
If there exist 77 > 0, k > 0, and < 7 < ^ smc/i that 

(4.40) n — kn 1 / 2 ^ < u(n) < n -J 2nv a log log (nv a ) , 

a 

then 

(4.41) q l n w u (a) — > q a almost surely (n — > 00). 

The proofs of Theorems 4.1 and 4.2 are given in Section 6. They rely on a 
law of iterated logarithm for martingales which we present in a later section 
(Section 4.6). 

The normalization functions used in Theorem 4.2 are difficult to imple- 
ment, because v a depends on the unknown quantile q a and the unknown 
limit parameter 9*. In this regard, the following corollary is helpful. 

Corollary 4.1. If 9* = argmin crf )Q ^{9), then 

< "l (talOo) o*(0o) = n (*(X) > Qa) ~ P«o(*(*) > Qa)" < 3- 

Therefore, the conclusions of Theorem 4.2 hold, if v a in conditions (4.36), 
(4.38), and (4.40) is replaced by \. 

To compare Theorem 4.2 with Theorem 4 of Feldman and Tucker in [12], 
we state here a refined version of their result. 
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Theorem 4.3. Let Yi, n , • • • ,Yn,n be the order statistics of i.i.d. samples 
Y±, ... ,Y n of a random variable Y . Let 

(4.42) w a = P(y < q a (Y)) - F(Y < q a (Y)f . 
If the normalization function v(n) € N satisfies 

(4.43) (1 + k) v / 2w a n log log n < [na\ - v(n) < Kn l/2+1 
with j,k,K positive constants, then Y u ^ n ^ n — > q a (Y) almost surely. 

We omit the proof, as it is similar to the proof of Theorem 4.2. Condition 

(4.43) is now expressed in a way that allows a direct comparison with (4.38). 
We recall the original condition in Theorem 4 of Feldman and Tucker, 

(4.44) (1 + k)y/2nlog\ogn < [na\ - v{n) < Kn 1/2+1 , 

which apparently does not depend on the variance of the tail probabilities. 
However, because w a < \ we see that (4.43) is indeed a weaker assumption 
than (4.44) used in [12]. 

4.4. Sequential parameter tuning for quantile estimation. We still have 
to provide a strategy to determine the limit parameter 6* and the construc- 
tion of an approximation sequence 9 n converging to 8* almost surely. For the 
estimation of the expected value E[/(X)], Theorem 3.1 suggests that the op- 
timal parameter 8* is the variance minimizer 8* = argmin e (Tj(^) which can 
be estimated by a stochastic approximation algorithm. However, for quan- 
tile estimation, the choice of an optimal parameter 8* is less obvious. If the 
distribution function F of the random variable Y = *&(X) is differentiable in 
a neighborhood of q a , the functional delta- method applied to the empirical 
process (see, e.g., Corollary 21.5 of [33]) suggests to minimize the variance 
of the weighted tail event wx(8)l{^(x)>q a } such that 

(4.45) 8* = Mgmmmi (to)Oo) o*(0). 

6 

Instead of arguing with the delta-method as above, we can also use The- 
orem 4.2 to motivate the choice (4.45) even in the most general situation, in 
which quantiles may not be unique. For instance, let us consider the quan- 
tile estimator q r n w v . The bounds for v(ri) in (4.38) lead to a bias for q r n w v . 
To minimize this bias, we must ensure that v(n) is as close as possible to 
n while, at the same time, satisfying condition (4.38). This means that we 
must select v a such that the term \J 2nv a log log (nv a ) becomes as small as 
possible. From the definition of v a in (4.35), we see that the parameter 8* 
satisfying (4.45) provides the smallest value for v a . The same arguments 
hold for q l n w y . 



14 



D. EGLOFF AND M. LEIPPOLD 



For the estimator q n ,w,v{ot) denned in (4.27), we must keep v(n) as close as 
possible to 1 in order to minimize the bias. From condition (4.36), we see that 
we must not only minimize \J 2nv a log log (nv a ) , but also \Jlnv log log(m>). 
Hence, for q n ,w,v( a ) we have to choose 6 to make both the variance of the 
weighted tail event Wx(&)l{iB(x)>q a } an d the variance of the weights wx{9) 
as small as possible. 

Unfortunately (4.45) is not constructive either because the quantile q a is 
not yet known and must be replaced by a suitable estimator. Suppose now 
that we could find a rough estimate q a for the quantile q a ; then the scheme 
(4.5) based on the stochastic gradient, 

(4.46) H 4a (X n+1 ,e n ), X n+1 ~ <p dn dX 
with 

(4.47) H q (x,6) = -l mx)>q} w x (9) 2 Vl x (9) 

could be used to generate a sequence (# n )n>o approximating the solution 
9* for the first order condition Vmi (fc oo )0*(^*) = 0. However, if q a is an 
extreme quantile, the simulated values for the stochastic gradient (4.46) 
would be mostly zero for parameter values 9 n close to the starting value 6q. 
Even worse, if the simulation produces a nonvanishing stochastic gradient, 
it is generally very inaccurate and could drive the parameter values to a 
wrong region of the parameter space. As a consequence, the convergence 
rate of the algorithm is very poor. It freezes at an early stage and one 
might be tempted to use a sufficiently large step size. However, in practical 
applications, compensating an erratic stochastic gradient with a large step 
size is not a solution, as it increases the risk that the algorithm fails to 
converge. 

A simple and practically very efficient approach is to gradually bridge 
from a moderate tail event to an extreme tail event during the simulation. 
More precisely, let 

(4.48) M quq2 (9) = b{n)m 1{qi ^(9) + (1 - b(n))m 1{q2oa)0 ^(9) 

with b(n) weighting functions depending on the sample index n. The val- 
ues qi are selected such that q a £ [(71,(72]- We choose q\ such that {^(X) > 
q±} is a moderate tail event. Hence, the corresponding stochastic gradient 
H gi (X n+ i,9 n ) can be estimated with sufficient accuracy for 9 n in a neigh- 
borhood of 9q. The value q% is selected in the range of q a or even larger. 
A preliminary simulation or some initial samples can be used to obtain a 
crude estimate for q a , including a confidence interval. The function b(n) 
is assumed to converge to zero as n— >oo. A suitable choice would be, for 
example, b(n) = l/log(n + 1) which decays sufficiently slow such that the 
component (4.46) of the stochastic gradient from q\ drives 9 n towards a 
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solution for the extreme tail event. Stochastic approximation with adap- 
tive truncation can then be used to generate a sequence of parameters 9 n 
converging to 

(4.49) 9* = arg min M quq2 {9) 

e 

as we will see below. 6 

4.5. Verifiable convergence criteria. Each of the above criterion is based 
on a stochastic vector field generated by a map H(x, 9). For instance, in case 
of (4.48), we have 

(4.50) H(x, 9) = b{n)H qi (x, 9) + (1 - b{n))H q2 (x, 9). 

We provide verifiable conditions on H(x, 9), its mean field, and the sequences 
7, e, which imply the convergence of Algorithm 4.1 for state independent 
transition kernels. To this end, we introduce for any compact set /C C the 
partial sum 

n 

S/ jn (7,e,/C) = l {(T( x: )e )> n} ^7 i: (F(Xj fc ,6' fe _i) - h{9 k ^)), 

k=l 

(4.51) 

1 <l <n, 

where cr(JC, e) = a(JC) A v{e) and a(fC) and v{e) are the stopping times 

(4.52) cr(/C) = ml{k> 1 | 9 k (£1C}, 

(4.53) i/(e) = inf{A; >l\\9 k - 9 k ^\ > e k }. 
If a= (a;)/ g N is a sequence, we write 

for the sequence shifted by the offset k. 



Assumption 4.2. The parameter set is an open subset of R d . For 
some p > 1, there exists a function W £ C p (9q) such that for every compact 
set K, C 0, 

(4.54) sup sup ^wf* <C K <oo 

xexe&c w x (6)W{x)p 



6 Yet another approach is to sequentially update an estimator q a for the quantile along 
the simulation as well to improve the upper value q2 in (4.49), leading to a coupled stochas- 
tic approximation scheme for the parameters (9 n ,q n ). A sequential quantile estimator has 
been proposed in [32] (see also [31]). Because the quantile estimator does interfere with 
the update scheme for the tuning parameter n , the convergence of the joint parameter 
set is more subtle. 
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with Cjc a constant only depending on K,. The mean field 
(4.55) h(9) = E 6 [H(X,e)] 

is continuous and there exists a C 1 Lyapunov function w : — > [0, oo) satis- 
fying the following conditions: 

(i) There exists < Mo < oo such that 

£ = {9e@ \ (h(9),Vw(9)) = 0} C {9 G | iv{9) < M }. 

(ii) For M > 0, let W M = {6> G | w{9) < M}. There exists M x G (M , oo] 
such that Wmi is a compact subset of 0. 

(iii) For any 9 G \ £ it holds that (h(9),Vw(9)) < 0. 

(iv) The closure of w(C) has empty interior. 

The sequences 7, e are nonincreasing, positive, and satisfy e n — > 0, 



(4.56) J> = °°» EK+f-j) 

n=0 n=0 V V n ^ / 



< OO. 



The existence of a Lyapunov function in (i) simplifies, if h = Vm is a 
gradient field of a continuously differentiable function m. In this case, we can 
choose w = rn. The next result is along the lines of Proposition 5.2 in [1]. Its 
proof is similar to the proof of [1], Proposition 5.2, but less involved because 
we consider only state-independent transition probabilities. Therefore, we 
do not need to consider the existence and regularity of the solution of the 
Poisson equation. The convergence of the algorithm is then a consequence 
of [1], Theorem 5.5. 

Proposition 4.1. Let 

A(5,M, 1 ,e)= sup hl ( JsvLp\\S 1 , n ('j,e,WM)\\>S 

(x,e)GX xK. 1 y ' J v n>l 



(4.57) 



+ P ^)(^)<Wm)}. 



If /Co C Wm > then for every M £ [Mo, Mi) there exist no, #o > 0, and a 
constant C > such that for all j > no, 

(4.58) F(su P K k >j) <C (sup A(6q,M, 7 ^,e^ 

K k>l J V fc>n 

Under Assumption we have for every M G [Mo, Mi) and 5 > 0, 

(4.59) lim A(5, M, ■~f^ k ,e*~ k ) = 0. 

fc— >oo 

In particular, the key requirements, (4-31) and (4-32), of Assumption 4-1 
are satisfied. 
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To completely specify the stochastic approximation algorithm, we first 
have to make some selections for the initial parameter 6q. Because our target 
criterion puts more emphasis on a moderate tail event at the beginning of 
the simulation, it is sensible to start with the reference density. Alternatively, 
we can start with a large deviation approximation. 

The performance of a stochastic approximation algorithm usually depends 
strongly on an appropriate selection of the step size sequence. However, with 
the bridging strategy in (4.48), our algorithm is considerably less sensitive 
to the choice of the step size parameters. Since the sequence of step size 
parameters -y n must satisfy condition (4.56), we simply set 

(4.60) 7n , 



n + l 

and select e n accordingly to satisfy the second condition in (4.56). The pa- 
rameter a serves as a tuning parameter. A practical approach is to follow 
a greedy strategy which starts with a large value for a and reduces it after 
each re-initialization. Alternatively we can determine a by some step size 
selection criteria based, for example, on an approximation of the Hessian of 
the target criterion. 

The algorithm can be further robustified by Polyak's averaging principle. 
The idea is to use a large step size 7 n of the order n -2 / 3 which converges 
much slower to zero than n _1 but is still fast enough to ensure convergence. 
The larger step size prevents the algorithm from freezing at an early stage 
of the algorithm far off the local minimum. Polyak and Juditsky show in 
[27] that the averaged parameters converge at an optimal rate. 

4.6. Law of iterated logarithm for martingales. Before we discuss an ap- 
plication for our adaptive quantile estimator, we present the law of iter- 
ated logarithm for the sequence of martingale differences wx n (&n-i) f {X n ) — 
E[/(X)] which we require as an ingredient for the proofs of Theorems 4.1 
and 4.2. We state the main result below and present the proof in Section 
6. We use the following notation. If M n is a square integrable martingale 
adapted to a filtration (J r n ) n >o with AMj = Mj — Mj_i, then we denote the 
predictable quadratic variation by 

n 

(4.61) (M) = 0, (M) n = ^2nAMf\ J^], n > 1, 

i=i 

the total quadratic variation by 

n 

(4.62) [M] = 0, [M] n = ^AMf, n > 1, 

i=i 

and by s 2 = E[AM 2 ] the total variance. 
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Theorem 4.4. Suppose the conditions in Assumption 4-1 are satisfied, 
and let f : X — >M. be a measurable function in C p {9q). Assume that 

(4. 63) *MAA >i , 

p+p* 

where p* is from condition (4-33). Let 

(4.64) w x :9^w x (9) 

be continuous in 6* for almost all x € X . Define 

(4.65) Z n = w Xn (6 n „ 1 )f(X n ) - E[f(X)}. 
Then M n = Ya=1 ^ s a s Q uare integrable martingale and 

(4-66) ^-jMy n =^ 



n— ¥oo 



S 



2 



(4.67) lim = {m } {6*) - E[/(X)] J ) = a 



' f\ 

n— >oo n 



Moreover, if we let (j)(t) = \Jlt log log(t), then 

(4.68) limsup(/»(VF n )- 1 M n = +1, 

n— >oo 

(4.69) liminf^(VF n )- 1 M n = -l, 

n— >oo 

almost surely, where the weighting sequence W n is given by either W n 
[M] n , W n = (M) n , orW n = s 2 n . 



5. Applications. We next provide an explicit example for our adaptive 
quantile estimator and compare it to crude Monte Carlo simulation. We 
borrow our application from the financial industry, more precisely from 
portfolio credit risk. The so-called Value at Risk (VaR) is by far the most 
widely adopted measure of risk and represents the maximum level of losses 
that can be exceeded only with a small probability. This quantile-based risk 
measure is of particular importance to market participants and supervisors. 
For credit risk, supervisors require banks to calculate the credit VaR as the 
99.9% quantile of the loss distribution. 



5.1. Importance sampling for portfolio credit risk. The aim of portfolio 
credit risk analysis is to provide a distribution of future credit losses for 
a portfolio of obligers based on historically observed losses and possibly 
combined with market views. In a simplified setting, the outstanding credit 
amount for each obligor i = 1, . . . ,m is aggregated to a net credit exposure 



ADAPTIVE QUANTILE ESTIMATION 



19 



Cj. Defaults are tracked over a single period. At the end of the period, the 
portfolio loss is 

m 

(5.1) C = Y^CiYt, 

i=i 

where Y\ ~ Ber(pj) are the default indicators. For portfolios of illiquid com- 
mercial loans or corporate credits, the exposures c, are generally assumed 
to be constant which gives rise to a discrete loss distribution. The quan- 
tiles are nondifferentiable and not unique. Hence, to construct an adaptive 
importance sampling algorithm, we can rely on the results of the previous 
sections. 

For our application, we start from a Gaussian copula framework (see, e.g., 
[11]), in which the default indicators are modeled as 

( 5 - 2 ) Y i = l{A i e{- 00 ,e i \}- 

The credit quality variable Ai is given by 

(5.3) Ai = yjl - v 2 s{i) X s ^ + Vs^Ci, i = l,...,m, 

for some classification function s : {1, . . . , m} — > {1, . . . , k}. Usually in credit 
risk management, the m obligors are classified into k industry sectors. The 
default thresholds 9i are calibrated to match the obligors' default proba- 
bilities. The common factors X = (X s ) s =i,...,k ~ -^(0, E) are multivariate 
Gaussian. The idiosyncratic part e = (€i)i=i,...,m ~ -^(0, l m ) is independent 
from X. We restrict ourselves to the adjustment of the mean of the common 
factors X and keep the covariance structure E fixed. We note that impor- 
tance sampling on the common factors can also be combined seamlessly with 
importance sampling on the idiosyncratic variables e = (ei)«=i,...,m- 7 

Given the above setup, we are in the setting of Section 4.3 with Y = C = 
and * : R k -> R given as 8 



(5.4) *(x) = J2 c i 1 



{J 1 - V lli) X a(i)+ v s(i)^i<Qi}' 

i=l v 

For the implementation of the adaptive importance sampling scheme, we 
use the criterion (4.48) and determine the values for q±, q2 as described 
in Section 4.4, that is, we start with a moderate q\ and choose q2 by an 
educated guess in the region of interest. 



7 For instance, [13] and [26] apply an exponential twist to the conditional default indi- 
cators Yi | X ~ Ber(pi(X)) where Pi{x) = — yl — v \i) x s(i)) / v a(i)), is the conditional 
default probability. 

8 For notational convenience, we drop the dependency of $ on e as it is not affected by 
the importance sampling scheme. 
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5.2. Verifying convergence criteria for Gaussian distributions. For our 
credit risk application, assume a fixed covariance structure, and endow the 
Gaussian distributions (4.9) with the Fisher information metric g-% in (4.11). 
Before we can proceed, we need to make sure that Assumptions 4.1 and 
4.2 hold in our setup. The noncompactness of the parameter space and 
exponentially unbounded likelihood ratios call for adaptive truncation and 
make it a challenging test case, even though the Gaussian distributions have 
many special analytical properties. From the expression 

w x (9) = esxp(-g s (x,e) + ^g s (9,e)) 

for the likelihood ratio, it follows that 

(5.5) w x (e)<expr^g^(e,e)^exp(^{x,x)^ Wi>l. 

The verification of Assumptions 4.1 and 4.2 for the Gaussian distributions 
is now a straightforward consequence of (5.5) and Holder's inequality. 

Lemma 5.1. If \\f\\e ,h < 00 f° r some h > 2, we can exchange differen- 
tiation and integration to obtain Vrrif(9) = Eg[H(X, 9)] with 

H(x,9) = (9-x)f(x) 2 w x (9) 2 . 

The Hessian with respect to the Fisher information metric gz, given by 

(5.6) V 2 m/(0) = E e [(id fc +(9 - X)(9 - X) T )f(X) 2 w x (9)% 

is positive definite. IfP(f(X) > 0) > 0, then mf(9) — > oo for g^(9,9) — > oo. 
In particular, there is a unique minimizer 

(5.7) 9* = argminm/(0) e R k . 

e 

Moreover, for some p > 1 there exists W £ C p {9q) satisfying (4-33) and 

(4.54). 

The parametrization (4.9) works rather well if the ratio of the largest and 
smallest eigenvalue of £ is not too far away from one and the dimension of £ 
is not too large. For many practical applications, the correlation ellipsoid is 
very skewed. The first few principal components explain most of the variance 
and the last few are negligibly small. Even though the metric defined in 
(4.11) properly respects the covariance structure, and we use the gradient 
relative to this metric, we require a suitable dimension reduction. Therefore, 
we translate the mean in the span of the eigenvalues of the first few principal 
components. Let £ = UAU T where U is the orthogonal matrix with columns 
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given by the eigenvectors, and A is the diagonal matrix of eigenvalues. We 
write 

(5.8) Ji :R l ^R k 

for the embedding of M. 1 into M. k ; that is, J/ sets the last k — I coordinates 
to zero with corresponding projection jj :M. k — > PJ. Let 

(5.9) ®i = {N(UJi(a), S) | a G M. 1 }. 

The first and second order differential of the likelihood l x (a) is 

(5.10) dl x {a) = jJhT x {U T x- J t a), d 2 l x (a) = — J T A _1 J. 
Hence, the Fisher metric on is 

(5.11) g a (u,v) = -E a [d 2 l x (a)(u,v)} =u T J T A^Jv. 
Because 

g a (Vl x (a),u) = Vl x {a) T Jj k- 1 J l u = dl x (a){u) = {x 1 U - a 1 jj )K~ 1 J lU 

and x T UA~ 1 Jiu = x 1 U JiJj A -1 Jjit, the gradient of the likelihood with re- 
spect to the metric (4.11) is 

(5.12) Vl x {a) = (jJU T x-a). 

We adapt Lemma 5.1 to the parametrization given in (5.9). 

Lemma 5.2. If \\f\\$ 0t q < oo for some q > 2, we can exchange differenti- 
ation and integration to obtain Vm^(a) = E, 9 ^[H(X,a)] with 

H(x, a) = (a - jJU T x)f(xfw x {9{a)f 

and 9(a) = UJi(a). The Hessian with respect to the Fisher information met- 
ric g a , given by 

V 2 m f (9) = E,[(id ; +(a - jJU T X)(a - jju 1 ' X) T )f{Xfw x {e{a))\ 

is positive definite. If P(/(X) > 0) > 0, then mj(a) — > oo for g a (a, a) — > oo. 
In particular, there is a unique minimizer 

(5.13) a* = argminmj(a) G M- 1 . 

a 

Moreover, for some p > 1 there exists W G C p (0q) satisfying (4-33) and 

(4.54). 
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5.3. Numerical example. We consider a set of 2000 obligors with default 
probabilities comparable to a typical loan portfolio. We assume that the 
portfolio risk is driven by m = 14 industry factors, but restrict our analysis 
using only the first two principal components which already explain 84% of 
total variance. In the current regulatory framework as promoted by Basel II, 
credit risk (as well as operational risk) needs to be calculated at the 99.9% 
quantile of the loss distribution. Performing a crude Monte Carlo (MC) sim- 
ulation, we see that the loss (expressed in percentage numbers) at the 99.9% 
quantile lies somewhere around 0.2. This crude estimate allows us to make 
an educated guess for the parameters q± and q2 required for our adaptive 
importance sampling (AIS) estimator. We set q\=0.1 and q2 = 0.23. Instead 
of using the MC estimate as a starting point, we could also first do an AIS 
simulation with some arbitrarily set q\ and q2 to find some more appropriate 
numbers in a second simulation. Our numerical experiments indicated that 
the algorithm is not very sensitive to these approximate choices. Indeed, we 
just have to guarantee that we choose q\ small so that the initial step sizes 
are large enough. To clarify this point with an example, we find that we 
get almost identical results for q\ = 0.01. More precisely, with a fixed seed 
for the random number generator we get an estimate for 99.9%-quantile of 
0.2271 with qi = 0.1 and 0.2276 with q x = 0.01. 

Based on a sample of 10,000 draws, Figure 1 shows the convergence of 
the mean shifts for our AIS algorithm. The solid line represents the path for 
the step size of order re -2 / 3 . The dashed line represents the averaged values 
based on Polyak's averaging principle. We observe that the shift in the first 
principal component, which explains 75% of total variance, is substantial. 
The shift in the second component, which explains an additional 9%, is only 
very small. 

Figure 2 plots the cumulative distribution function for the right tail of 
the distribution. In contrast to standard MC simulation, our AIS algorithm 
provides a very smooth distribution function. Therefore, we can expect a 
considerable reduction for the variance of our quantile estimators. To sub- 
stantiate this conjecture, we additionally perform 1000 independent quantile 
estimations. In Table 1, we report the results for the standard Monte Carlo 
simulations to calculate F^{a) and for our AIS algorithm using the quan- 
tile definition in (4.30) which is based on the weighted empirical distribution 
Fnwviv)- The first column shows the different loss levels at which we sim- 
ulate the quantiles. The next two columns report the mean values for the 
estimations F£\{a) and Fn^ lV (a), respectively. The final row reports the 
variance ratio defined as the variance from the MC simulation divided by 
the variance of the AIS estimator. When we compare the variances of the 
estimators, we observe that for the region of interest, that is, around the 
99.9% quantile, our AIS estimator outperforms the result from the MC sim- 
ulation by a factor of around 20. This number increases further to more than 
112 when we look at the 99.99% quantile. 
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Panel A 




8.6 



Fig. 1. Parameter convergence from Polyak's averaging principle. The graph illustrates 
the convergence of the means of the first two principal components. The solid line represents 
the convergence for step size of order n~ 2//3 . The dashed line represents the averaged values. 
Panel B zooms in the rectangle area marked in panel A. 

6. Proofs. For the proofs for Theorems 4.1, 4.2 and 4.4 we start with 
collecting the basic properties of the generalized inverse of an increasing 
function. 9 



Lemma 6.1. Let F be a right continuous increasing function. Then, the 
generalized inverse 

(6.1) F*~(<*) = inf{a; I F(x) > a} 

is increasing and left continuous, and we have 

F(x)>a <^ F^(a)<x; 

F(x)<a <^ F^(a)>x; 
F(xi) < a < F(x 2 ) O xi < F^ (a) < x 2 ; 



9 See for instance [29], Section 0.2. 



24 D. EGLOFF AND M. LEIPPOLD 

100 

99.98 
99.96 

^ 99.94 

c 
g 

■C 99.92 

c 

o 

■g 99.9 

To 

a 99-88 
> 

| 99.86 

o 

99.84 
99.82 
99.8 

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 

loss level 

Fig. 2. Cumulative distribution function of Monte Carlo and AIS simulation. The graph 
plots the cumulative distribution function using a Monte Carlo simulation and the AIS 
algorithm based on 10,000 samples. Our area of interest, the 99.9% quantile, is marked 
with a dotted line. 

F(F < ~ (a)) > a, with equality for F continuous; 
F <r '(F(x)) < x, with equality for F^ increasing; 
F continuous <^ F^ increasing; 
F increasing 44> F*~ continuous. 

6.1. Proof of Theorem 4-4 (iterated law of logarithm). Let K n denote the 
active truncation set for 9 n and 

(6.2) = lim K n , 

which exists because of assumption (4.32). We have 

0i G K, Kn Vz < n with V{ ^ 0. 
To get rid of the condition ^ 0, we decompose M n into 

n n 

(6-3) M n = J>1{^0} + 

i=l i=l 
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Table 1 

Quantile estimates for standard MC and AIS simulation at different loss levels based on 
1000 simulations. Quantiles are expressed in percentage numbers, and the variance ratio 
is defined as the variance of F n i{a) divided by the variance of Fn w V (ot) 



Mean (in %) 



Loss level A 


F r '^ (a) 




Variance ratio 


0.1000 


98.2449 


98.2217 


1.7988 


0.1100 


98.6193 


98.6011 


2.0759 


0.1200 


98.9099 


98.8943 


2.6251 


0.1300 


99.1366 


99.1238 


3.0764 


0.1400 


99.3162 


99.3053 


3.7290 


0.1500 


99.4577 


99.4488 


4.6726 


0.1600 


99.5701 


99.5638 


5.9360 


0.1700 


99.6596 


99.6559 


6.0449 


0.1800 


99.7309 


99.7287 


7.3198 


0.1900 


99.7877 


99.7851 


9.7769 


0.2000 


99.8322 


99.8305 


12.0368 


0.2100 


99.8677 


99.8663 


14.3319 


0.2200 


99.8955 


99.8946 


17.5088 


0.2300 


99.9173 


99.9167 


21.1197 


0.2400 


99.9344 


99.9340 


25.3890 


0.2500 


99.9477 


99.9474 


29.9903 


0.2600 


99.9581 


99.9578 


35.6695 


0.2700 


99.9662 


99.9658 


40.5849 


0.2800 


99.9726 


99.9724 


45.6628 


0.2900 


99.9776 


99.9776 


49.8262 


0.3000 


99.9816 


99.9817 


61.3741 


0.3100 


99.9849 


99.9852 


73.4072 


0.3200 


99.9875 


99.9878 


87.6728 


0.3300 


99.9897 


99.9899 


112.9891 


0.3400 


99.9914 


99.9917 


121.8812 


0.3500 


99.9929 


99.9932 


134.6083 



The second term satisfies 

(6-4) ^&3> l= o} < 5^&l{i/ i= o} < °°> 

1=1 8=1 

almost surely, because the number of reinitialization remains finite and con- 
verges to zero if normalized by 4>(W n ). We can therefore just drop the second 
term in (6.3) and assume that 

regardless whether v n ^ holds or not. Let K, be an arbitrary compact set. 
Assumption (4.33) implies 

(6.5) w^e^y-^fix)^!^ < C(diam(/C)) (? - 1 ^(x) , ?- 1 |/(^)| ,? - 
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By Holder's inequality, we have for q < p and all 6 S /C, 

m qJ (e)i {eeK} = Ee [w x (ey- 1 \f(x)\n {ee)C} } 

(6.6) < ||/||^ 0ip C(diam(/C))' ? - 1 E, [W(X)^- 1 )/^)] (p - 9)/9 

< oo 

as long as q satisfies the condition 

P(P* + 1) 



(6.7) q< 

p + p* 

Note that (6.7) implies also q < p because p ^ + pP < p for p> 1. Condition 
(4.63) implies 

(6.8) rn q j(9) < oo VI < q < 4. 

Let K, be a compact neighborhood of 0*. Lebesgue's theorem together with 
the continuity condition (4.64) and the upper bound (6.5), which is inte- 
grable by (6.6), shows that 

(6.9) m Lq :6^m f , q (9) 

is continuous for q < 4. Without loss of generality, we assume from now on 
that E[/(X)] = 0. By assumptions (4.33) and (4.34), we have for q < p and 
a > 1 

E[w Xn (9n-i) q \f(X n )\il {Koo=:j} \ 

= E eo [w Xn (e n -i) q - 1 \f(Xn)\ q i {Koo=:j} ] 

<ll/ll^ , p ^ [^(^-l) p((? - 1)/(p " 9) l {ftoo =, } ] (p - 9)/p 
<\\f\\l hP n^=j) ip - q)/pl/a , 

< (7(p)0-3)/pi/o e V(p-9)/P 1 / a 

X fn(P~9)/pl/a+p(g-l)/(p-(j)m J ,, log p (e)y'||^|ig-l < 

VP / II 1100:91 ' 

if Qi < P* holds for 

(6.10) S1 = ?fci> « 

p — g a — 1 

We may choose a arbitrarily large at the expense of increasing the constant 
in the above estimate. Therefore, q\ < p* holds if 

(6.11) P{q - l) <f 
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which is equivalent to (6.7). Next, we choose 

(6.12) p<e -o.m p *p(q-l)/(p- q )^ 

such that we can sum over j = 1, . . . , oo to obtain 

m /i9 (^_i)=E[^ n (^-i) 9 |/(^)| 9 |^n-i] 

oo 

= Y j nUX n (On-l)V(Xn)\ g l {Kao=]} \ T n -l\ 
3=1 

< C(p,a,p,p*,q, \\f\\e , P , \\W\\e 0lP *) 
with an upper bound independent of n. Assumption (4.63) implies that 

(6.13) supE[m /ig (0 n _i)] < oo VI < q < 4. 

n 

We have 

n 

(6.14) (M) n = ^(mf^Oi-x) - nf(X)} 2 ). 

i=l 

Because Q^rrif^iQ) is continuous at 0* and -^6* almost surely, we 
obtain from Cesaro's lemma that 



(M) n _ 1 



-Y,(m f M^i)-nf(X)] 2 ) -4 m f , 2 (9*)-E[f(X)] 2 = a 2 f (e% 



n n 

i=l 

almost surely. By (6.13) and Lebesgue's dominated convergence theorem, 



(6.i 5 ) °A=mM^c }K 

n n 1 

Set 

n 

Mn = ^fe 2 -E[^|^ t -l]). 
i=l 

By (6.13) M n is a square integrable martingale because 

E[AM 2 | Fi-x] =E[(e 2 - E[e 2 | F-x]) 2 | Ti-x] 
<8( ? n /i4 (^-i)+E[/(X)] 4 ). 

More precisely, 

E[AA? 2 | Fi-x] = mf A (0i-x) - m^O^) 2 - 4m /)3 (^-i)E[/(X)] 
+ 8 ? n /i2 (^_ 1 )E[/(X)] 2 - 4E[/(X)] 4 . 
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The continuity of 9 \— > mf tP (6) in 9* for 1 < p < 4 and Cesaro's lemma imply 
1 n 

-J^E[AJWf|Ji_i] 

n — ^ 



n . 



(6.16) -»• m />4 (0* ) - m />2 (r ) 2 - 4m /i3 (0*)E[/(X)] 

+ 8m /)2 (r)E[/(X)] 2 -4E[/(X)] 4 , 
almost surely, which together with (6.15) implies 

n 1 " 

(6.17) lim s~ 2 ]T(£ 2 ~ ntf I ^-i]) = I™ - - E[£ 2 | Ti-i]) = 0, 

i=l i=l 

almost surely. Therefore, 

(6.18) lim = 1+ lim ( M») "'i - E [£ | ^-1]) = 1, 

n->oo {M) n n^oo\ n J 71 7^ 

almost surely. To apply Corollary 4.2 in [15], we need to verify the three 
conditions: 

(6.19) s~ 2 [M] n ^r] 2 >0 almost surely; 

CO 

(6.20) \fe>0 XX'EN&ll^i^}] <oo; 

n=l 

00 

(6.21) 35 >0 ^ S ; 4 E[|e„| 4 l { |^|< 5M ]<oo. 

n=l 

Condition (6.19) holds because 



n— >co g2 n— >oo \ n J 



(6.22) + lim ^(£ 2 - E[£ 2 I JVi]) 



1=1 



almost surely, as a consequence of (6.15) and (6.17). By (6.15), we may 
replace s 2 by n for the verification of conditions (6.20) and (6.21). 

Let 1 < a < 2. We first approach (6.20). From Holder's and Chebyshev's 
inequalities, we have 

n _1 E[|e„|l { | en | >£v ^ } ] 

< ^ _1 E[|^| 2a ] 1/(2a) P(|^| > E^) 1 " 11 ^ 
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(6.23) 



< ^- 1 E[\u\ 2a ] 1/(2a) mn\ 2a } 1 ' 1 



/(2a) 



\ 2a-l 



< e 1 - 2a E[\S n \ 2a ]n- a 
for every fixed e > 0. Therefore, 



(6.24) ^^~ 1 E[|^|l { | eri | >eV ^]< £ 1 - 2a ^E[|^| 2a ]n- a <oo. 



r 1 ] 

n=l n=l 



This last equation implies condition (6.20). To check condition (6.21), note 
that 

oo oo 

E^ 2e [|^i 4 i { | 5 „i<^ } ] < E^ 2]E [(^) 4 " 2a i^i 2ai {i^<^}] 

n=l n=l 

(6.25) 

OO 

< < J 4 - 2a ^ n -°E[|£ n | 2a ] <oo. 

n=l 

The sums (6.24) and (6.25) are finite because 

(6.26) E[\U\ 2a | Jn-i] < 2 2a - 1 (m L2 M + E[f(X)] 2a ), 
and sup n E[m/ 5 2a(^n)] < oo, as shown in (6.13). 

6.2. Proof of Theorem 4-2. Under the assumptions of Theorem 4.2, the 
boundedness of the functions 

(6.27) fv = l(-oo, y ]°% 

allows us to apply the law of iterated logarithm (Theorem 4.4). We verify 
the convergence statement by proving that 

(6.28) P(?n,«M,(a) < q a ~ S i.o.) = V5 > 0, 
and 

(6.29) F(q n<WjU (a) > q a i.o.) = 0, 
where i.o. stands for infinitely often and is defined as 

oo oo 

(6.30) A n i.o. = limsup A n = f] M A k . 

n=l k=n 

Let F(y) = P(Y < y) denote the distribution function of Y = Hf(X). We first 
analyze the estimator q n ,w.u( a )- Define 

(6.31) A n (5) = {q ntW , u (a) <q a - 5}. 



30 D. EGLOFF AND M. LEIPPOLD 

It follows from (4.26) and Lemma 6.1 that 

l {Yi<q a -S} > « 



(6.32) 



Let 



A n(&) = { 1 V Wil { 



^^{wi^iYiKq^-S} ~F{q a - 5)) > u(n)a^Wi-nF(q c 



(6.33) W n ( V ) = | 

i 

We consider 



< (1 + rj)(j>(nv 



(6.34) 
Then 



A n (8)^A n {5)nW n {r,)utW n {rj). 



A n (8)nW n ( V ) 
(6-35) C feCtOiV^fo-a} - F(q a - 8)) 

^ i 

> is(n)a(n — (1 + rj)<p(nv)) — nF(q a — 5) 

Similarly, we have 

B n = {q n ,w,u(a) > q a } 



(6.36) 



and 



t\Yl( Wil {Yi<q a } ~ F (Q*)) < v(n)aJ2wi -nF(q a )\ 



B n nw n (7]) 

(6-37) C l^Wil^^ - F(q a )) 

i 

< u(n)a(n + (1 + rj)(/)(nv)) — nF(q a )\. 

For arbitrary 77 > 0, let 

(6.38) A^ L (5, v ) = {X^il^,,,-*} - F(q a - 5)) > (1 + rj)4>{nv qa - S 
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and 

(6.39) B^{n) = |^Kl { Y,< fe} - F(q a )) < -(1 + r?)0(n^)|. 



Then 

(6.40) 
and 

(6.41) 



— —4>(nv qa _s) + — —n < v(n)(n - (1 + r))<j)(nv)) 
a a 

=> A n (S)nW n ( V )cA^(6,r,) 



u(n)(n + (1 + rj)4>(nv)) < — ^— ^-ra — -cf)(nv c 

a a 



=^ B n nw n ( v )cB™ L (6, v ). 

Recall that 

(6.42) limsup(A n U B n ) = limsup A n U limsupl? n . 

n n n 

Hence, if (6.40) is satisfied, we have 

P(A n (5) i.o.) < F(A n (5) n W n (rj) i.o.) + P(CW„(tj) Lo.) 

(6.43) 

<F(A^ L (S lV ) i.o.)+P(CW n fa) i.o.). 
From the law of iterated logarithm in Theorem 4.4, we know that 

(6.44) P(^ IL (5, rj) i-o.) = 0, P(CW„(??) i.o.) = 0. 
Therefore, ¥(A n (5) i.o.) = for all 5 > 0. In the same way, we obtain 

(6.45) F(B n i.o.) = 0. 

To verify that condition (4.36) implies (6.41) and (6.41), it is sufficient to 
note that F(q a ) > a. Because -F(g Q — 5) < a, it follows that 

i±^( nt , to _,) + F{qa ~ 6) n <n- kn 1 ^' 

for n large enough and for all 5 > 0. 

The convergence proof for w v (a) is slightly simpler. From (4.29) and 
Lemma 6.1, we get 

A n{S)={ql w Ja)<qa-^} 



.46) 



{B^W*}-! 1 -^-*))) 

<i/(n)(l-a)-n(l-F(g a -$)) 
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and 



3-47) =lj2^il{Y i>qct }-(l-F(qa))) 



>u(n){l-a)-n(l-F{q a ))j. 

For arbitrary i] > 0, let 

A r n L1L (S,v) = {j>il{y i>fa -*} - (1 - F(q a - 8))) 

(6.48) 
and 

(6.49) B% hlh ( V ) = { J>il { y i>ffa} - (1 - F( fe ))) > (1 + 
We have 

1 - F(q a - 6) 1 + 7? v 

n n ) < ; n ~ ~, 4>{nv qa _ 5 ) 

1 — a 1 — a 

(6.50) 



and 



(6.51) ^W + ^lnOW B r n cB r n LlL ( V ). 
1 — a 1 — a 

By the law of iterated logarithm (Theorem 4.4), we obtain 

(6.52) P«' LIL (M) i.o.) =0,P(^ LIL ( ? ?) i-o.) = 0. 

Therefore, conditions (6.50) and (6.51) are sufficient to guarantee (6.28) and 
(6.29) for q r nwv - Because 1 — F(q a ) < 1 — a and 1 — F(q a — 5) > 1 — a for 
all 5 > 0, condition (4.38) is sufficient for (6.50) and (6.51). 
In a completely analogous manner, we obtain 

^) = {?w(«)<3a-^} 

(6.53) 

= ^Yl^ Wil { Y ^<1c- S } ~ F ( qa ~ 5 )) - V ( n ) a ~ nF {la - 5) I 
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and 
(6.54) 

= | ^K^r^fe} - F (Qcx)) < v{n)a - nF{q a ) \ . 

i 

This time, let, for r/ > 0, 

(6.55) A^(S,n) = fetfliV^fo-a} - F(g a " ^)) > (1 + rM™W)} 
and 

(6.56) ^ LIL (r?) = [X>il{is< go } - F(g a )) < -(1 + ??)<^o}- 
We have 

(6 .57) l±^0(n^) + F(ga ~ J) n<K^) =► A l n {5) C ^ LIL (5, r?) 

a a 



and 



(6.58) !/(„)< £te2i2n - i±^(nu a ) B l n c ^ LIL (r?) . 

a a 

By the law of iterated logarithm, equations (6.57) and (6.58) are a sufficient 
condition to guarantee (6.28) and (6.29) for q l n w v . Similarly as above, (4.40) 
is sufficient for (6.57) and (6.58). This proves Theorem 4.2. 

6.3. Proof of Theorem 4-1- We again apply the law of iterated logarithm 
4.4. We only prove the result for q n ,w,v{ot)- The other estimators are treated 
analogously. Because F is increasing in q a , it follows that F^ is continuous 
in a. It is sufficient to prove for any 5 > that 

(6.59) P(g w (a) <q a -s i.o.) = 

and 

(6.60) F(q„(a) >q a + 6 i.o.) = 0. 
For v(n) = 1, we obtain from (6.40), 



.61) 



(p(nv qa -s) H n + (1 + r})<p(nv) < n 

a. a 

=> A n (S)nW n (r,)cA^ L (5,r l ). 
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(6.62) 



we deduce 



(6.63) 



n < 



^2(wil {Yl < qa+ 8} - F(q a +5)) Ka^Wi- nF(q a + 8) L 

i i 

n - (1 + f])4>{nv) 9{nv qa+ s) 

a a 

=> B n (5)nW n ( V )c 3^(5,7]). 



For any 5 > 0, F(q a — 5) < a and F(q a + 5) > a. Therefore, if n is large 
enough, conditions (6.61) and (6.63) are satisfied. We conclude as in the 
proof of Theorem 4.2. 

6.4. Proof of Proposition 4-1- Let fC be a compact subset of W. We apply 
Markov's and Burkholder's inequality, 



max||5i fc(7,e,£)|| > 5 

k<n 



- Sp 



1 {n<<r(lC,e)}'^2lk\\ H ( X k,8k-l) ~ h{6k-l) 



p/2- 



k=l 



< 



2 P B r , 



J2lknHk-X<n<v(K,e)}(\\H(Xk,9k-l)\\ P 



ll^*-l)l| P )] a/P 



p/2 



where B p is a universal constant only depending on p. To estimate 

m { k-l<n<a(K,e)}m(X k ,e k ^W)} 2/P , 
note that by our assumptions 



E [l{fc-i< ( r(K:,e)}l E e fc _ 1 [\\H(Xk,0k- 



l2/P 



E 



we^iX^WPiXk) 

CMMk-l<*(K,e)}®6 [W*>(X k )]f P < C 2 K \\W\\l hp 
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where the constant Cjc comes from assumption (4.54). Because h is contin- 
uous and K. compact, l{k~i<cr(K,e)}\\h(0k-i)\\ is bounded as well. Therefore, 
we arrive at the estimate 

1 / n 

p(max ||5i, fc ( 7 , e,K)\\ > 5) < C-l ]T 7 f 

~ n \k=l 

The bound 

n / \ p 

is derived similarly as in the proof of Proposition 5.2 in [1] . 
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